Some basic information about the data:
## [1] "fixed.acidity" "volatile.acidity" "citric.acid"
## [4] "residual.sugar" "chlorides" "free.sulfur.dioxide"
## [7] "total.sulfur.dioxide" "density" "pH"
## [10] "sulphates" "alcohol" "quality"
## 'data.frame': 4898 obs. of 12 variables:
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.700
## Median : 6.800 Median :0.2600 Median :0.3200 Median : 5.200
## Mean : 6.855 Mean :0.2782 Mean :0.3342 Mean : 6.391
## 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.: 9.900
## Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :65.800
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.00900 Min. : 2.00 Min. : 9.0
## 1st Qu.:0.03600 1st Qu.: 23.00 1st Qu.:108.0
## Median :0.04300 Median : 34.00 Median :134.0
## Mean :0.04577 Mean : 35.31 Mean :138.4
## 3rd Qu.:0.05000 3rd Qu.: 46.00 3rd Qu.:167.0
## Max. :0.34600 Max. :289.00 Max. :440.0
## density pH sulphates alcohol
## Min. :0.9871 Min. :2.720 Min. :0.2200 Min. : 8.00
## 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100 1st Qu.: 9.50
## Median :0.9937 Median :3.180 Median :0.4700 Median :10.40
## Mean :0.9940 Mean :3.188 Mean :0.4898 Mean :10.51
## 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500 3rd Qu.:11.40
## Max. :1.0390 Max. :3.820 Max. :1.0800 Max. :14.20
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.878
## 3rd Qu.:6.000
## Max. :9.000
The dataset contains 4898 observations with 12 features. Quality is an integer value, but apart from that all other features are numeric (float) values. The mean quality is 5.878 and median quality is 6. The highest quality is 9 and the lowest 3.
Fixed acidity ranges from 3.8 to 14.2, whith a median value of 6.8. About 75% of the wines have a volatile acidity less than 0.32 and a citric acidity less than 0.39. For both free sulphur dioxide content and total sulfur dioxide content, there is quite a difference between the max and min values (2.0 vs. 289.00 for free sulfur dioxide and 9.0 vs. 440.0 for total sulfur dioxide. There is less variation in the density of wines, with the minimum being 0.9871, the maximum 1.0390 and a median value of 0.9937. pH values are in the range between 2.72 and 3.82. Alcohol content varies between 8 and 14.2, with a median alcohol content of 10.40
First, I look at the distribution of quality ranks.
## [1] 4898
##
## 3 4 5 6 7 8 9
## 20 163 1457 2198 880 175 5
There are 4898 wines in the data set. From tabling the values we see that each tail is thinly populated. There are only 20 observations having the lowest quality (3) and only 5 having the highest quality (9). There are far most observations having a judged quality of 6, 2198 out of 4898. From plotting a histogram showing the distribution, quality seems to be roughly normally distributed.
Getting a sense of the distribution of the other different variables.
It seems like many of the variables are somewhat normally distributed (although the binwidths are not adjusted).
The data set includes a description of each variable, and I decide to take a closer look at the distribution and relevant statistics for each variable. I also try to adjust binwidths very roughly in the different plots by looking at the scale on the x axis.
Fixed acidity is the amount of tartaric acid in the wine. It is measured in grams per litre (dm^3). It is one of the main acids found in wine, and is the source of “wine diamonds”, the small potassium bitartrate crystals that sometimes form spontaneously on the cork or bottom of the bottle. Chemically, tartaric acids lowers the pH during the fermentation process to a level where spoilage bacteria cannot live.
As can be seen, fixed acidity appears to be more or less normally distributed. Summary table containing some main statistics for volatile acidity:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.300 6.800 6.855 7.300 14.200
Volatile acidity is the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste. It is measured in acetic acid - g / dm^3. Volatile acidity is roughly norally distributed, but has a much longer right tail, indicating outliers with a higher level of volatile acidity.
Summary table containing some main statistics for volatile acidity:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2100 0.2600 0.2782 0.3200 1.1000
Citric acid is measured in g / dm^3. According to the descrition in the dataset documentation, it is usually found in small quantities, and can add ‘freshness’ and flavor to wines. The distribution of citric acid is normally distributed. There does, however, seem to be “spikes” both at 0.5 grams and 0.75 grams - maybe due to rounding errors?
Summary table containing some main statistics for citric acid:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2700 0.3200 0.3342 0.3900 1.6600
Residual sugar is the the amount of sugar remaining after fermentation stops. It is measured in g / dm^3. Wines rarely contain less than 1g of residual sugar per litre. If a wine contains over 45 grams per litre, it is typically considered sweet.
Summary table containing some main statistics for residual sugar:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
After adjusting the binwidth, I’m intrigued by the “residual sugar” distribution, as it does not seem to be normally distributed.
Looking more closely at the distribution of residual sugar, this time including outliers
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
There is quite a bit of difference when it comes to residual sugar content of the different wines. The mean residual sugar value is 6.391 and the median 5.2. The minimum value is as little as 0.6, while the maximum is 65.8. The 1st quartile value is 1.7 while the 3rd quartile value is 9.9. I plot the distribution to have a closer look at the distribution, which reveals that most wines have a residual sugar content below 20, and with a spike between 1 and 2.
There appears to be some outliers to the far right in the plot, so I make a new plot where I zoom in to get a closer look, which reveals that there are only five wines having a residual sugar value above 25, and only three over 30 (two with 31.60 and one with 65.80 respectively).
I further table the wines with residual sugar values over 25:
Residual sugar values of wines with residual sugar > 25:
## [1] 31.60 31.60 65.80 26.05 26.05
Quality values of wines with residual sugar > 25:
## [1] 31.60 31.60 65.80 26.05 26.05
## [1] 6 6 6 6 6
All of these wines have a judged quality of 6, which is the most common quality level, so they don’t stand out quality-wise. Recalling that wines with a residual sugar value over 45 g per liter are considered sweet, it seems that the Vino Verde wines are not particularly sweet, with only one wine in the dataset being “sweet”. Since all the outliers are deemed to be of quality 6 it does not seem that the outliers impact taste in either direction.
Chlorides represent the amount of salt in the wine, and is measured in grams per liter (dm^3). The distribution appears to be bimodal, and with a long right tail.
Summary table containing some main statistics for chlorides:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600
Most wines have a salt content of between around 0.036 grams and 0.05 grams per liter.
Free form SO2 is measured in milligrams (mg) per liter. SO2 is used as an additive, as it prevents microbial growth and the oxidation of wine. The distribution of free sulfur dioxide appears to be normal, with some outliers to the right.
Summary table containing some main statistics for free sulfur dioxide:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 23.00 34.00 35.31 46.00 289.00
Total sulfur dioxide is the total of free and bound forms of S02, and as free sulfur dioxide, it is also measured in mg per litre. In low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine. It is roughly normally distributed.
Summary table containing some main statistics for total sulfur dioxide:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 108.0 134.0 138.4 167.0 440.0
Density is measured in grams per liter. The density of the wine depends on alcohol content and residual sugar content. It is (very roughly) normally distributed.
Summary table containing some main statistics for density:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
pH describes how acidic or basic a substance is on a scale from 0 (very acidic) to 14 (very basic). Most wines have a pH value between 3 and 4. It is normally distributed.
Summary table containing some main statistics for pH values:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820
As can be seen from the summary table, also the wines in this data set have a value between 3 and 4.
Sulphates (potassium sulphate) is a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant. It is measured in grams per litre. It is normally distributed, with a longer right tail.
Summary table containing some main statistics for sulphate:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2200 0.4100 0.4700 0.4898 0.5500 1.0800
Looking more closely at the distribution of alcohol content. The minimum value is 8, and the maximum value 14.2. The median and mean values are 10.4 and 10.51 respectively.
Summary table containing some main statistics for volatile acidity:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
I’m intrigued by the “spikes” in the distribution of alcohol level, and add breaks to the x axis to see where they occur.
I wonder if the small “spikes” in the distribution coincide with e.g. round numbers (may e.g. be caused by the manufacturers reporting rounded numbers instead of accurate). I create a modulo function to calculate the ending decimal/modulus in order to get a impression of whether or not numbers are rounded. With this function I create a new variable, the modulus or ending decimal of the alcohol content of each wine. I then create a histogram to plot the distribution of the ending decimals.
In this data set, alcohol values seem to be stated in increments of 0.1. As shown in the plot above, I would argue that there is a higher frequency of wines with alcohol content corresponding to “round numbers”, with a ending decimal of 0 or 5. There are for instance 650 wines with a stated alcohol content with an ending decimal of 0, compared to 410 ending in 0.1 and 387 ending in 0.9. My guess is that this is due to the fact that some producers round the stated alcohol value to a round number.
In this case, this sort of POSSIBLE inaccuracies may not be of much significance, but each such inaccuracy has the potential to slightly affect all other analysis done on the data set (for example fitting linear models).
I will look more closely into the relation between alcohol content and other variables in the bivariate and multivariate analysis.
The dataset contains 4898 observations with 12 features. Quality is an integer value, but apart from that all other features are numeric (float) values.
Mean quality is 5.878 and median quality is 6. Although the quality scale varies from 1 to 10, the highest quality is 9 and the lowest 3.
Quality is the main feature of interest in this dataset.
I’m open minded as to which other features will support my investigation into the quality. I have no particular knowledge of wine chemistry, and as at the beginning of the investigation, I do not have any intuition as to which variables correlate with higher quality rankings.
I created a variable called alcohol.ending.decimal, which is the ending decimal of the stated alcohol content (alcohol contents is stated in increments of 0.1 %). I used the variable to plot the distribution of ending decimals to see if there were a higher occurence of “rounded” numbers with regard to alcohol content, which I believe there is. Since I did not plan on using the variable any further, I dropped it from the data set after conducting the analysis.
As mentioned above in connection with the univariate plots most of the variables seemed to have roughly normal distributions, albeit with very long tails. Residual sugar and alcohol did not seem to be normally distributed.
I did not yet perform any operations to tidy or rearrange the date. The data seems relatively tidy, with each variable as a column and each observation as a row. However, since R studio can deal with numbering each observation (row), I removed the X column.
I started my bivariate analysis by using ggpairs to get an overview of how the different variables relate to each other.
From the plot. I’m noting that alcohol and density seems to have some degree of correlation with the other variables in the data set, but that other than that there does not seem to be much correlation between the variables.
The main feature of interest is how the features of the data set relate to quality. I am therefore particularly interested in identifying features that are related to quality, and use this as a starting point for my analysis.
I decided to try looking at the relation between the different features and quality by with boxplots, since they give an indication about the distribution of the variables at each quality level. I therefore plot boxplots of all variables against quality by using grid.arrange:
From running ggpairs to produce a scatter matrix, I recall that alcohol did have the highest correlation with quality (0.4355747). I want to look further into the relation between alcohol and quality. The below plot shows alcohol level by quality.
## [1] 0.4355747
There seems to be a tendency for lower quality wines to have lower alcohol content and better quality wines to have higher alcohol content. That being said, there seems to be quite a bith of variance - for example the lower quality wines seems to vary considerably with regard to alcohol content.
Also density looks promising with regard to correlation (-0.3071233) and merits a closer look:
## [1] -0.3071233
I also want to investigate the relation between pH values and quality:
## [1] 0.09942725
The correlation between pH and quality is 0.09942725.
Except for the very highest and the very lowest quality wines, mean pH across quality groups seem to be relatively similar.
However, the shape of the distribution seem to be slightly different, which is more visible if I use facet wrap to create a separate pH density plot for each level of quality:
Very low quality wines seem to vary much with regard to pH values, whereas the highest quality wine tend to have pH values more clustered together. I wonder whether there is some kind of relation here, or whether it is simply a result of there being few observations at the extreme ends of the quality spectrum.
Alcohol content seems to be related to other features, such as density (-0.7801376), which I want to look further into:
## [1] -0.7801376
There also appear to be some correlation between density and residual sugar level (0.8389665), which I want to investigate further:
## [1] 0.8389665
## $title
## [1] "Density by residual sugar"
##
## attr(,"class")
## [1] "labels"
Alcohol also seems to be related to SO2 levels. The plot below shows the relation between alcohol and total.sulfur.dioxide.
## [1] -0.4488921
The plot below shows the relation between total.sulfur.dioxide and density. From running ggpairs, I know they have one of the strongest correlations between the variables:
## [1] 0.5298813
The correlation between density and total sulfur dioxide is 0.5298813.
Total.sulfur.dioxide and free.sulfur.dioxide seem to have some degree of correlation (0.615501), and I want to examine this in closer detail.
## [1] 0.615501
Sulfur dioxide (SO2) protects wine from oxidation and bacteria. However, too much of it can impact taste.
From this research I understand that free and total sulfur dioxide levels are related. This leaves my curious as to whether the PROPORTION of free to total sulfur dioxide levels have an impact on quality.
I decide to create a new variable free.sulfur.dioxide.proportion which is free.sulfur.dioxide/total.sulfur.dioxide, and plot the density distributions by quality:
## [1] -0.1747372
## [1] 0.008158067
## [1] 0.1972141
The correlation between the proportion of free.sulfur.dioxide to total.sulfur.dioxide does indeed increase by a tiny amount, but with a correlation with quality of 0.1972141, it is not a strong predictor of quality.
As stated in the univariate plots section, I started my analysis of which variables were important for wine quality with an open mind. I therefore decided to plot all the variables in a boxplot using quality on the x axis. For several of the features (e.g. alcohol), there seem to be a polynominal/quadratic relation between the quality and the feature. This is e.g. the case with alcohol, where the highest and lowest quality wines have higher alcohol content, and the medium-low wines have lower alcohol content on average.
In particular there seems to be relations between alcohol and some other variables. In particular there seems to be a relation between alcohol and quality (the feature of interest). The correlation is 0.4355747.
## [1] 0.4355747
This trend can also be shown in a density plot:
Alcohol content seems to be related to other features, such as total.sulfur.dioxide (correlation of -0.7801376) and density (correlation of -0.4488921).
The strongest relationship I found was the relationship between density and residual sugar. The correlation between these two variables is 0.8389665.
Recalling that alcohol and density seemed to have a degree of correlation, I want to see how this relates to quality by adding color for quality:
From the above plot it seems that wines of higher quality are typically higher in alcohol and lower in density. Even though I have used lower alpha and jitter, it might, however, be argued that the plot is overplotted. I therefore decide to facet the plot by quality level:
For instance, wines of quality 4 and 5 seems more likely to be clustered in the upper left corner of the plot, while wines of quality 7, 8 and 9 are more likely ot be clustered in the lower right side of the plot (indicating higher alcohol content and lower density). That being said, while I would argue that there is such a general tendency, there is also a great deal of variance for each quality level. So while a lower quality wine is more likely to be in the upper right part of the plot, this does not mean that all low quality wines will necessarily be in that part of the plot.
There also appear to be some correlation between density and residual.sugar level, and I want to se how this relates to quality:
It appears that higher quality wines on general tend to have less density, and less residual sugar. Also in this plot, there is a danger of the plot being overplotted, and I decide to take a closer look at the distribution for each quality level, by facetting for quality. I’m adding horizontal and vertical lines with the median values of density and residual sugar respectively to make it easier to see how the distribution is placed relative to the median values of each variable:
The plot below shows the relation between total.sulfur.dioxide and density. From running ggpairs, I know they have one of the strongest correlations between the variables. By adding color for quality, I want to see if there is some relation to quality:
From the plot, it appears to me that higher quality wines have lower density and lower total.sulfur.dioxide. I decide to facet the plot over different quality levels to see if the distribution differs for each quality level:
In the bivariate plots section, I looked at the relation between free and total SO2 levels. I want to have a look at this relation again, now adding color indicating quality level:
## [1] 0.615501
It appears that higher quality wines have less total.sulphur.dioxide and more free.sulphur.dioxide.
Since alcohol is the feature which in itself has the strongest relation with quality, I want to investigate the relation between alcohol and the free sulfur dioxide proportion and their relation to quality by adding color for quality:
Again, facetting by quality since the above plot may be overplotted:
From the above plot, it appears that higher quality wines have a higher level of alcohol, as well as a higher proportion of free sulfur dioxide.
In the multivariate analysis, I looked at the relation between alcohol and density, which seem to strenghten each other in terms of looking at quality. Higher quality wines are typically higher in alcohol and lower in density. The features density and residual sugar also seemed to strengthen each other in terms of looking at quality, with higher quality wines tend to have less density, and less residual sugar. THis is also true for the relation between alcohol and the proportion of free sulfur dioxide. Higher quality wines have a higher level of alcohol, as well as a higher proportion of free sulfur dioxide.
I found it interesting that the distribution of pH values seemed to be so different across different quality levels. Given the low number of observations at the extreme ends of the quality spectrum, however, it is hard to say whether this is a result of a genuine difference between high and low quality wines, or whether it is particular just to this sample of white wines.
N/A
This plot shows the boxplot distributions of alcohol content for each quality level. A scatter plot with low alpha is added on top of the boxplot layer. For the three lowest quality levels (3-5), alcohol content seems to decrease with quality, i.e. the worst quality wines have a higher alcohol content than the better ones. For the higher quality wines, however, the opposite is true - from quality level 5 to 9, the alcohol level generally increases with quality. There therefore seems to be a polynominal/quadratic relation between the quality and the alcohol level.
The correlation value between alcohol and quality:
## [1] 0.4355747
Alcohol is the variable most strongly related to quality with a correlation of 0.4355747. While this does not imply a very strong correlation, I would argue it is significant, and that alcohol level impact the likelyhood of a wine being deemed to be of good quality.
This plot shows the distribution of pH values across the groups of wines with the same quality. For the different quality levels, the distribution seems to vary. I have added vertical line with x intercept at the mean pH value for all white wines to make comparisons across the different quality levels (since mean is 3.188267 and median is 3.18 i did not feel the need to add both).
The distribution of pH values seem to vary with the quality level. The lowest quality wines are distributed more evenly across different pH values, ranging from about 2.7 to 3.7. High quality wines, on the other hand, seems to be distributed across a more narrow pH range, ranging from about 3.15 to 3.45. Further, the lower quality wines (especially 4 and 5) seems to have more pH values below the mean, while higher quality wines (8 and 9) have pH values above the mean pH value. This might suggest that more acidic/sour are judged to taste less good.
The correlation values between pH and quality:
## [1] 0.09942725
The correlation between pH value and quality is 0.09942725 which does not imply a strong correlation. While I would argue that some interesting trends may be seen in the above plot, pH values in itself is not a strong predictor of wine quality.
I have created a new feature which is the proportion of free sulfate dioxide to total sulfate dioxide. The plot is a scatter plot with the proportion of free sulfur dioxide on the x axis and alcohol content on the y axis.
The plot is facetted by quality level, in order to see if the distribution varies with quality. Further, color is added indicating quality level.
Finally, I have added median values of both the x and y value (the median value across all quality level) as dotted lines, in order to make it easier to compare the distributon for a give quality level with the median value.
From the above plot, it appears that higher quality wines have a higher level of alcohol, as well as a higher proportion of free sulfur dioxide and therefore are more likely to be in the upper right part of the plot. Similarily lower quality wines have a tendency to have lower alcohol content and a lower proportion of free sulfur dioxide, and therefore being in the low left part of the plot. In lower quality wines, most of the wines seems to be placed in the lower left part of the plot, while for higher quality wines most of the wines are placed in the upper right part of the plot. There is, however, a great deal of variance for each quality level.
The white wine data set contains information on 4898 white wine variants of the Portuguese “Vinho Verde” wine. My overall goal with the analysis was to uncover a relation between the different features and wine quality. I had no prior knowledge or intuition to guide me as to which features might be related to wine quality. I therefore started by analyzing the distribution of the different variables in more detail. In the bivariate section, I started quite broadly by producing a scatter matrix which also showed the correlation values between the different values. I produced boxplots in order to see how the different variables related to quality. I then chose to analyze the relation between the different features, primarily based on values with a stronger correlation. I mainly used scatter plots in this analysis, with added geom_smooth to give more information about the relation between the variables. I chose scatter plots (with lower alpha), because I feel they give a lot of information about the distribution, and also show where the density of data points is highest. For some variables (pH and proportion of free sulfur dioxide), where I believed the distribution would be of interest, I investigated the variables further using density distributions. In the multivariate section, I again used scatter plots to investigate how variables with some degree of relation to each other related to quality. I used color to indicate quality. However, since this led to a degree of overplotting, it was necessary to facet the plots by quality in order to get a clearer view of the different distributions by quality level.
From my analysis of the different features of the dataset, I would say there is a connection between some of the features and wine quality. Alcohol level in particular appears to be correlated with higher quality wines. However, even though there is some relation between the different features it was not as pronounced as the strong, linear relation between price and carat in the diamonds dataset. I would say it was a bit disappointing not to uncover a stronger relationship. However, it would on the other hand be surprising if something as complex as the subjective taste of wine could be broken down to 12 chemical properties. There are likely interactions between the chemical properties that all work out to produce the subjective experience of the wine. Some of the analysis might be influenced by the fact that there are very few observations at the extreme ends of quality. For example there are only 20 observations of wines judged to be of quality 3 and only 5 for the highest quality wines judged to be of quality 9. The data set is only related to wines from a region in Portugal. It would be interesting to investigate whether the findings in this dataset would be different if wines from a different region or a range of regions were used. The data also seems to be limited to one year. It would also be interesting to see year on year change values, particularly as one often hear that wine producers talk about “good years” and “bad years”. It would be interesting to see if the chemical properties of the wine changes from a good year to a bad year.